ggml-hexagon: flash-attention and reduce-sum optimizations#19141
ggml-hexagon: flash-attention and reduce-sum optimizations#19141max-krasnyansky merged 19 commits intoggml-org:masterfrom
Conversation
…ew vectorized implementations
…x2 functions for improved performance
…function for improved readability
…proved performance
| sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 4)); | ||
| sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 8)); | ||
| sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 16)); | ||
| return sum01; |
There was a problem hiding this comment.
Optimize reduction sum by processing two vectors simultaneously.
…ew vectorized implementations
…x2 functions for improved performance
…function for improved readability
…proved performance
|
Overall, good idea and you gave me more ideas to implement/cleanup :). |
Interesting. That explains why we need the extra |
Yep. QF32 and QF16 have extra bits that are not visible to the SW. Here is the branch where I fixed this issue and also went through and made everything consistently use https://github.com/qualcomm/llama.cpp/tree/hexagon-fa-and-reduce-sum Tested on Gen3,4,5 and X-Elite. I'm seeing a nice bump in perf across the board. Not huge but significant. Please pull/merge/rebase, see how it does on your setup and I think we're good to merge. |
# Conflicts: # ggml/src/ggml-hexagon/htp/hvx-reduce.h # ggml/src/ggml-hexagon/htp/matmul-ops.c
| file(TO_CMAKE_PATH "${HEXAGON_TOOLS_ROOT}" HEXAGON_TOOLS_ROOT) | ||
| if (NOT IS_DIRECTORY "${HEXAGON_TOOLS_ROOT}") | ||
| message(FATAL_ERROR "Make sure HEXAGON_TOOLS_ROOT point to the correct Hexagon SDK installation.") | ||
| endif() |
There was a problem hiding this comment.
Thinking it may be good to derive HEXAGON_TOOLS_ROOT from hexagon_sdk.json in HEXAGON_SDK_ROOT.
…19141) * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
…19141) * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Further to the discussion in PR #19025, this implements the dual row dot product for flash attention.
Key changes
HVX Vector Math Optimizations
Added
hvx_vec_reduce_sum_qf32x2, a helper function for efficiently reducing and accumulating two HVX vectors of qf32 values, and refactored several places in the codebase to use this function for dual-accumulation scenarios. [1] [2] [3] [4] [5]Introduced new "rx2" (dual accumulation) versions of dot product functions for both f32-f16 and f16-f16 cases (
hvx_dot_f32_f16_aa_rx2,hvx_dot_f16_f16_aa_rx2), improving performance by processing two accumulations in parallel. [1] [2]Refactored the main attention kernel (
flash_attn_ext_f16_thread) to use the new "rx2" dot product functions when possible, improving block processing efficiency.Performance
8Gen20c21677e42610805c4llama3-1b-q4